Feature Generation for Sequence Categorization
نویسندگان
چکیده
The problem of sequence categorization is to generalize from a corpus of labeled sequences procedures for accurately labeling future unlabeled sequences. The choice of representation of sequences can have a major impact on this task, and in the absence of background knowledge a good representation is often not known and straightforward representations are often far from optimal. We propose a feature generation method (called FGEN) that creates Boolean features that check for the presence or absence of heuristically selected collections of subsequences. We show empirically that the representation computed by FGEN improves the accuracy of two commonly used learning systems (C4.5 and Ripper) when the new features are added to existing representations of sequence data. We show the superiority of FGEN across a range of tasks selected from three domains: DNA sequences, Unix command sequences, and English text.
منابع مشابه
Feature-Based Learners for Description Logics
Inductive learning algorithms that have been applied to learning in description logics (DL) have not been as well studied and optimized as the more general class of feature-based learning algorithms. This paper proposes a way to apply feature-based learners to DL learning tasks by presenting a method to compute a feature-vector representation for DL instances. The representation is based on con...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملUnsupervised Feature Generation using Knowledge Repositories for Effective Text Categorization
We propose an unsupervised feature generation algorithm using the repositories of human knowledge for effective text categorization. Conventional bag of words (BOW) depends on the presence / absence of keywords to classify the documents. To understand the actual context behind these keywords, we use knowledge concepts / hyperlinks from external knowledge sources through content and structure mi...
متن کاملAn NTU-Approach to Automatic Sentence Extraction for Summary Generation
A B S T R A C T Automatic summarization and information extraction are two important Internet services. MUC and SUMMAC play their appropriate roles in the next generation Internet. This paper focuses on the automatic summarization and proposes two different models to extract sentences for summary generation under two tasks initiated by SUMMAC-1. For categorization task, positive feature vectors...
متن کاملA feature selection technique for generation of classification committees and its application to categorization of laryngeal images
Article history: Received 25 August 2007 Received in revised form 12 August 2008 Accepted 26 August 2008
متن کامل